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ABSTRACT 

A catalog clearinghouse is defined as an electronic 
commerce entity that provides a common web-based 
storefiront to a group of merchants. This paper considers the 
implementation of a specific kind of clearinghouse: one 
that hosts catalogs for a large number of similar merchants 
(e.g., thousands of restaurant menus). A standard 
architecture is described, followed by a novel architecture 
that uses the emerging Extensible Marlcup Language 
(XML) and Extensible Stylesheet Language (XSL) 
standards for data storage, search and graphical 
presentation. 

For large clearinghouses, the new architecture realizes 
significant benefits in terms of presentation flexibility, 
powerful search capabilities and increased performance. 
XML and XSL are used at the client side for flexible 
presentations with low maintenance costs, while XML and 
traditional RDBMS techniques are combined at the server 
for searching and business logic. Detailed examples of 
XML document type definitions and XML stnictured data 
are given, along with details on the recursive 
transformation of XML into HTML, using rule-based XSL 
style sheets. 

1. INTRODUCTION 

As c-conunercc on the WWW intensifies, a number of 
meta-businesses have emerged Known by various names 
such as "virtual mail", such businesses provide a single 
access point for groups of merchants. Customers access the 
"catalog clearinghouse" web site to browse catalogs of 
merchant companies that deliver goods and services. The 
clearinghouse handles ordering and payment and transmits 
orders to merchants for fulfillment. This enables customers 
to search for goods, compare prices, etc., at a central 
location instead of visiting multiple dislocated web sites. 

Pcnnission to make digital or hard copies of sU or part of thU work 
for pmonal or claurooin use is ffuded without fee provided thii 
copies are not made or distributed for profit or commercial 
advantage and that copies bear this notice and the fuQ citation oo 
the first page. To copy otherwise; to republish, to post on servcn 
or to redistribute to lists, requres prior specific permissioa and^or a 
fee. 
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The clearinghouse mcxiel is also attractive to small businesses 
for which the benefits of maintaining an individual wd) site may 
never justify the overtiead. 

The model requires little technical expertise on die part of die 
merchant, and little modification to standard business procedure. 
For instance, orders can be forwarded from the clearinghouse 
via fax or telephone. Restaurants fit this category, and it is 
therefore natural that clearinghouses have appealed which 
specialize solely in restaurants offering take-out or home 
delivery services. Among these, Cyberchefi [12] and 
Cybermeals [13] include thousands of restaurants. In this paper, 
a restaurant clearinghouse is used as an illustrative example. 

A look at existing clearinghouses, large and small, reveals two 
broad innplementation strategies: 

1 . The clearinghouse tnaintains a set of customized web pages 
for each merchant 

2. The clearinghouse maintains a database containing catalog 
informadon for each merchant. When a merchant's page is 
accessed, its web pages are generated dynamically. 

For targe clearinghouses that store several thousand catalogs 
(e.g., Cybermeah targets 25.000 restaurants by die end of 1998), 
only die second approach is feasible. Maintaining and updating 
thousands of customized web forms (e.g. to make price and 
product changes) would be prohibitive, even if custom web 
pages were initially created for each business. However, the 
second approach has certain drawbacks: 

1. All catalogs have an identical **look and feel" because 
catalog web pages are programmaucally generated. This makes 
it hard for each merchant to establish an individual sales 
identity. 

2. Catalog search capabilities are limited and inefficient 
Consider die case of a restaurant clearinghouse diat allows each 
restaurant to list die chief ingredients of its dishes, along with 
nutrition information such as calories, fat grams, diabetic 
information, etc A simple keyword search may not accurately 
target ''a Chinese dish widi chicken and snow peas, widiout 
peanuts or MSG. widi at most 25 grams of fat« widun 45 
minutes delivery time of Harvard Square in Boston." 

If such a query is allowed, odier problems arise. In any 
normalized relational database schema [10] constructed to store 
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such infonnation, this query will require a costly relational 
join of several tables. A large clearinghouse may expect 
thousands of such queries per nainute at peak periods. 
Finally, while relational databases can easily search for 
expressions containing comparison and Boolean operators, 
they are cumbersome for searches dealing with hierarchical 
relationships between fields sxicb, as "inside/outside", 
"above/below", and transitive co(Tq)ositions of such 
relationships. 

This paper presents an architecture, or implementation 
model, that uses Extensible Markup Language (XML) [S] 
and Extensible Stylesheet Language (XSL) [8] to overcome 
the above problems. These are web-based structured 
document standards that are being developed based on 
Standardized Generalized Markup Language (SGML) 
(15.18]. 

The architeaure presented allows a clearinghouse to have a 
unique presentation format for each merchant's catalog, 
without the overhead of maintaining individual customized 
web pages. By moving formatting tasks to the client 
(browser), the scheme provides load balancing that reduces 
web server workload. Furthermore, expressive and efficient 
catalog searches are made possible. Fmally. the fault 
tolerance that is realized by using a relational database to 
drive the business logic (including transaction processing 
for ordering, payment and delivery) is largely left 
undisturbed. 

The following section contains a brief introduction to XML 
and its companion. XSU After that, die paper describes the 
catalog clearinghouse architecture in detail. Examples of 
XML document type definitions and structure are given, 
along with examples of die recunive transformation of 
XML into ordinary HTML using XSL style sheets. The 
paper concludes with an overview of software tools for 
implementation and avenues for future work. 

2. MARKUP AND STYLESHEET LANGUAGES 

2.1 XML for Structured Document Representation 

Standard Generalized Markup Language (SGML) is a 
document representation standard developed by the 
International Organization for Standardization, published as 
ISO 8879 [18]. It is a meta-language for specifying the 
syntax of domain-spedfic markup languages. HypcrText 
Markup Language (HTML) is one application of SGML 

SGML contains features that are difficult to learn and 
implement, although they are not commonly used in 
practice. Therefore, the World Wide Web Consortium 
tW3C) has constructed a version of SGML suitable for web 
use. namely XML 15). Differences between XML and 
SGML include the following: 

• XML docs not allow exceptions {inclusions and 
exclusions) in element content models. 

• XML does not have the operator that denotes 
elements which are required but can appear in any order. 

• XML tags for elements with non-empty content 



models must come in pairs, e.g. <a>...</a>. Tags with empty 
content models have the special form <a/> (a trailing "r after 
the tag name). 

• XML docs not include abbreviation techniques (tag 
minimization, tag omission). 

These changes simplify XML at die cost of expressive power. 
For instance, exceptions arc a powerfiil expressive feature of 
SGML, but are difficult for both human readers and automatic 
parsers to handle because they introduce ambiguity ([22] 
describes techniques for modeling and understanding exceptions 
and presents a case for dieir inclusion in XML). 

The structure of XML documents can be described using 
Document Type Definitions (DTD), which are specified in 
Extended Backus-Naur Form [2]. A DTD is similar to a context 
free grammar, so standard parsing techniques from compiler 
Uieory apply. XML documents can conform to a DTD. They are 
information rich in the sense dial all semantic text is 
appropriately marked up. By contrast, semantic infonnatioa in 
HTML documents has to be extracted from unstructured 
paragraphs, using natural language processing or heuristic 
techniques based on keyword positioning. 

XML documents generally don't contain presentation 
information. This is stored elsewhere using a system of 
structure-based presentation rules (explained below). The clear 
identification of semantic information via markup elements 
makes it possible to have expressive user interfaces for 
searching, e.g [3], and allows for powerfiil indexing schemes for 
quickly searching very large textual databases, e.g. [3.6i.l6]. 

Both Microsoft and Netscape have aniKsunced diat version 5 of 
their respective browsers will contain XML processors, which 
are software components diat parse XML documents and make 
the resulting tree structures available for manipulation in code. 

22 XSL for Transformation and Rendering of XML 

The W3C has specified an Extensile Stylesheet Language 
(XSL) [8] for presentation and display of XML documents. XSL 
can rearrange documem structure, making it strictly more 
powerful than Cascading Style Sheets [22], whkii can only 
specify how marlcup elements are rendered (font, color etc) 
without allowing element reordering. 

XSL works by recursively transforming XML documents into 
other formau such as HTML, according to a set of style rules 
augmented widi a scripting language. XSL is based in part on a 
styling language for SGML. Document Style Semantics and 
Specification Language (DSSSL) [19]. The tiansfbrmation 
process can take place inside an XML-enabled browser, or in an 
applet or script run by a browser without XML capabilities. 

3. ARCHTfECTURES FOR CLEARINGHOUSES 

Two architectures are described The first is a conventional 
implementation model for clearinghouses, whidi can be applied 
using established web development techniques. The second is a 
new architecture. 
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3.1 The Traditional Web Dcvdopment Modd 

The main components in a traditional implementation of a 
catalog clearinghouse will generally include the users' web 
browsers (clients), the dearinghousc web server, a 
RDBMS. a fulfillment agent and the merchants* order 
acceptance points (telephones, fax machines, e-mail 
addresses, etc.). The user must first locate a particular 
catalog by entering keywords into a search form, which is 
posted to the web server. The server formats one or more 
SQL SELECrr queries based on the search terms. The 
queries arc sent to the RDBMS. where their execution will 
typically require relational joins on a number of tables in 
the database. Query results are returned to the web server. 

Using a sequence of such queries, the web server 
determines the number of matdiing catalogs. If zero or 
multiple catalogs match, the user must continue searching. 
Once a single catalog is identified, the web server 
constructs an HTML order form dynamically. This acdon 
can be expensive. For instance, a restaurant menu involves 
many items in multiple categories (appetizers, entries, 
desserts, beverages), lite server must execute nested loops 
to construct an order form, usually embedded inside HTML 
tables. The order form is sent to the user's browser. 

The user completes the order form and submits it to the 
web server. The server constructs atkiitional SQL queries 
(which may now include SELECT, INSERT and 
UPDATE) that enter the order into the ROBMS. At this 
point, a fulfillment agent is responsible for transmitting the 
order to die appropriate m^chant If orcfers must be 
fulfilled in real-time (e.g., restaurants), the web server itself 
may aa as the fulfillment agent, perhaps by relaying the 
order to a fax machine. In other cases (e.g., mail-order 
catalogs) a separate process can periodically exU-act new 
orders from the database for transmission to merchant order 
acceptance points. 

A complete c-commerce implementation involves details 
not included in the above outline, inchiding user session 
management (with HTTP cookies or hidden form fiddsX 
secure payment and transaction processing for fault- 
tolerance. 

3^ A New ArcfaUecture Based on XML and XSL 

Representing catalog data using XML leads to an alternate 
implementation architecture, whidi offers the benefits 
outlined in the introductton. The new architecture extends 
the conventional one, and is ouUtned in Rgure I. Two 
databases are now used, an XML database for catalog 
information, and an RDBMS for order processing. 

The workings of the new architecture are sketched in 
Rgure 2. In step (1), the user submits a search form to the 
web server. Search interfaces for structured text documents 
can be quite powerfiil. For instance, the user can specify 
expressions with Boolean and comparison operators as well 
as hierarchical and containment restrictions, restricting 
expressions to only match within certain markup-element 
contexts. These contextual restrictions are much more 



precise than the word-distance based searches offered by typical 
web search engines. Again, although Boolean and comparison- 
based searches can be done with relational databases, they are 
likely to be less efficient than ones on text databases, which are 
indexed differently. 

The completed search form is sent to the web server (2). The 
web server formulates a series of queries to the XML database 
(3), where a search occurs (4). eventually locating a single 
catalog (as before, the user may need to be consulted again if a 
unique catalog is not located). The catalog has XML and XSL 
components. These are returned to the web server (5), which 
does only minimal processing (6). e.g. to add user session 
information. The data is sent to the dient browser (7), which 
transforms the XML into HTML by applying the XSL rule set 
(8). An example of this transformation is ^own in secdon 33. 

The browser now renders the catalog order form for the user. At 
this point, the process continues much as in the traditional 
model, using an RDBMS and a fulfillment agent (not shown), h 
is in^x>rtant to continue to use an RDBMS here because foult- 
tolcrance is especially crucial once an order is placed. 
Transaction processing systems have dear model of failure 
events arvl a well-establistel track record of fault-tolerance. 

In the new architecture, the fulfillment agem may query both the 
XML database and die SQL database for additional infomuition 
(e.g. merchant address). This is because die order information 
entered into die RDBMS will only store catalog and item 
identifiers, to avoid duplicating data diat is in die XML database 
(because duplicating information would violate general d at a b ase 
design principles). 

The benefits of die new implementation can now be reviewed 
First, die web server's load is significandy reduced because 
formatting is done in die browser. Next, since each XML catalog 
may have a different XSL rule set, catalogs are rendered 
uniquely. In addition, powerfid search techniques drawing on a 
long history of SGML research can be applied to perform 
searches that are easier to express and more efficient than is 
possible widi relational data. For instance. XML-QL [14] allows 
for the specification of queries widi conditions, structured text 
"joins'*, padi regular expressions to handle hierarchy, nested 
queries, etc. Because XML-QL itself uses a syntax similar to 
diat of XML documents, it is easier to express structured text 
qt^es in a language like XML-QL than in SQL. 

Finally, it should be noted diat die architecture can be modified 
to effect die XML to HTML transformation on die server, using 
transformation software such as XT [9]. While placing an 
additional load on the server, this solution can generate browser- 
independent HTML. This would aUow developers to deploy 
solutions based on dus architecture without waiting for 
XMUXSL-enabled browsers to become commonplace. 

3J Sample XML and XSL Representations 

In diis section, sample XML and XSL data for a ''restaurant 
clearinghouse" are shown. These are simplified for clarity and 
brevity, e.g.. die restaurant DTD shown only stores one 
restaurant menu instead f many. The DTD for a restaurant 
includes its address and its menu. The menu has sections for 
appetizers, entries, dcss^ and beveragies. 
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Figure 1: Clearinghouse Architecture Outline 



XML queries 
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results 




Figure 2: Clearinghouse Architecture Detail 



Each section contains one or niore menu items. A 
rcxiaurani must offer ai least one entrfe, but may offer zero 
or more items in each of the other categories. A menu item 
consists of a description, a price and a "low fat** flag. An 



XML DTD for these conditions is shown in Figure 3. The EBNF 
notation uses a combination of common regular expression 
sequencing operators and context-free grammar productions. For 
instance, the operators and denote repetitions of 
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"zero or morc% "one or more" and "zero or one" times 
respectively; the "I" operator denotes "or" (alternation) and 
the operator denotes sequencing. "#PCDATA" denotes 
(essentially) XML text 

Figure 4 shows fragments of a data file conforming to the 
OTD. Note that all tags are properly nested and appear in 
pairs, except the '*<lowfat/>'* tag which has a content tnodel 
of EMPTY in the DTD. and hence can appear alone with 
the trailing '*r. The file is indented for clarity, but in 
practice is stored according to XML rules for processing 
whitespace. 

Figure S shows a minimal XSL style sheet for transforming 
the above XML data into ordinary HTML. The style sheet 
specifies four transformation rules. Each transformation 
rule contains a pattern and an action. The XML document's 
natural parse tree structure is traversed to find nodes that 
match the pattern part of a rule. At matching nodes, the 
action pan is used to derive a transformed sub-tree, which 
is attached at the current node (replacing the one already 
there). This process continues recursively until ru) patterns 
match. Therefore, transformations can be specified 
compactly without resorting to looping constructs. 
Funhemwre. the W3C design goals state thai XSL should 
leverage the power of scripting languages. This means that 
progranuners more comfortable with a procedural paradigm 
will be able to use loops to effect the above transformation. 
Scripting languages that make tree data available to the 
programmer for use in calculations, decisions and loops can 
perform arbitrary restructuring. 

The style sheet in Figure 5 is hand-coded according to the 
current W3C working draft for XSL [8], and tested using 
XT [9], a Java-based transformation engine for XSL. 
Earlier versions were tested in Microsoft's Technology 
Preview XSL processor 1241, an ActiveX control. The style 
sheet transforms a restaurant XML document into a simple 
HTML file where all menu items (firom all categories 
joined together) are listed as an HTML table. HTML tags 
embedded in the XSL rules are shown in uppercase for 
clarity. The style sheet in Figure 5 contains a total of four 
rules. 

While XSL style sheets are not very amenable to hand 
coding, the abundance of graphical devdopmcnt tools 
indicates that style sheet generators will soon be routinely 
available (some already exist, e.g. [25]). Once a style sheet 
is created, a transfbrroalion engine such as )CT [9] 
recursively applies the XSL rules to transform the sample 
XML document into HTML as shown in Rgure 6. 

4. IMPLEMENTATION APPROACHES AND 
FUTURE WORK 

The XML-based clearinghouse architecture is designed to 
be implemented efficiently using primarily off-the-shelf 
components. Available components range from inexpensive 
choices for prototyping, to more expensive choices for 
deployment Many XML and XSL resources are listed in 
(111 (an excellent starting point for learning XML). Many 
software tools for structured docimients are also listed at 



[20]; space limits mention here to a few. 

Web server activities can take place in-process or out-of- 
process. Within the server process, server-side scripting and 
dynamically linked libraries (i.e., server Application Program 
Interfaces) provide choices between simplicity (scripting) and 
performance (APIs). Alternately, the server can initiate 
additional worker processes using the older Conunon Gateway 
Interface (CGI/bin) techniques. Server-side scripting is a good 
compromise between efficiency and complexity. and\ is 
supported by leaxiing web servers, including Apache (PHP/FI 
scripting), Microsoft {VBScript/JScript) and Netscape 
{Javascript). 

RDBMS choices range from simply placing the database on the 
web server's host, where the server accesses them direcdy using 
the appropriate libraries (e.g. ODBC compliant drivers), to 
multi-tiered solutions involving separate servers. A separate 
RDBMS server is preferable for many reasons. Choices range 
from the inexpensive MySQL to high-end enterprise servers 
fi^m Oracle, IBM, Microsoft etc.. which support transaction 
processing. A wide variety of e-conunerce code is available to 
implement ordering and payment. 

Tools for XML and XSL are rapidly appearing, based on 
extensive SGML development Search engines like PAT [16] 
have shown that high performance is achievable on very large 
databases. A public domain choice is SARA [6]. These search 
engines have been extensively used with enormous textual 
databases and offer the advanced search features described 
eariier. 

Intriguing techniques blending relational and structured-text 
theory are emerging, which should lead to additional choices for 
XML databases. In addition, new query languages have 
appeared, e.g. SGML-QL [17] and XML-QL [14). Search 
engines for these should foUow. Also, XML processors in Java 
and other languages [1 1] are suitable for developing prototype 
search engines (cuireiUly, Java lun-time performance may be 
inadequate for high-vohime production use at the server side). 

XSL software suitable for prototyping and development (but not 
for deployment, yet) is available. These include XML Styler 
[25] for XML-based style sheet generation and the XT processor 
[9] for XSL transformations. Other Java-based XSL tools [11] 
and SGML styling tools [20] are available as wdl. 

For this paper, work is under way on a prototype 
implemeruation using some of the tools mentioned above, and 
on additional design issues. The latter include enabling 
merchants to update their data and presentation online (probably 
using Java), developing flexible web search interfaces based on 
current SGML interfaces or Java displets [7], tracking customer 
information, etc. 

Additional work is also needed on applying database validation 
rules and type checking to XML's textual data, and efliciendy 
performing SQL-like aggregate queries on XML data for reports 
and data mining. Thwc problems will probably involve the 
application of techniques integrating relational and structured 
data or relational variants [1,4]. 
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<?xinl version="l .0" ?> 




<!DOCTYPE 


restauranc 


[ 


<! ELEMENT 


restaurant 


{name, address, menu)> 


< .'ELEMENT 


name 


(tPCDATA)> 


<! ELEMENT 


address 


{street, city, (state | region), zip, phone) > 


<! ELEMENT 


menu 


(appetizers, entrees, desserts, beverages) > 


< [ELEMENT 


street 


( # PCDATA) > 


<!ELEMENT 


city 


(# PCDATA) > 


<! ELEMENT 


state 


(#PCDATA) > 


< ! ELEMENT 


region 


{ *PCDATA) > 


< ! KJif^f^'V 


zip 


{#PCDATA)> 


< [ELEMENT 


phone 


(ffPCDATA) > 


< (ELEMENT 


appetizers 


{item*)> 


< (ELEMENT 


entrees 


(item+) > 


< (ELEMENT 


desserts 


(item*)> 


< (ELEMENT 


beverages 


(item*)> 


< (ELEMENT 


item 


(description, price, lowfat?) > 


< (ELEMENT 


description 


(#PCDATA)> 


< ( ELEMENT 


price 


( #PCDATA) > 


< ! ELEMENT 
1> 


lowfat 


EMPTY> 



Rgure 3: Sample XML DTD 



<restaurant> 

<name>Toad Hall Cafe</name> 
<address> 

<street>116 Old Canton Road</street> 

<city>Boston</city> 

{ ... state, zip and phone ... ) 
</address> 
<menu> 

<appetizers> 
<item> 

<description>cheese f ritters</description> 

<price>4 . 99</price> 
</item> 
<item> 

<description>shrimp platter</description> 
<price>5 . 49</price> 
< lowfat /> 
</icem> 
</appetizers> 
<entrees> 
<iteci> 

<de3cription>grilled duck breast</description> 
<price>12 . 75</price> 
<lowfat/> 
</itCTi> 
</entrees> 

( . , . desserts and beverages . . - ) 
</menu> 
</re3taurant> 



Figure 4: Sample XML Document 
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<xsl: stylesheet result-ns="htira "> 

<xsl : template match="/'> 

<HTML> 

<HEAD> 

<TITLE> 

<xsl rprocess select=* restaurant/name •/> 

</TITLE> 

</HEAD> 

<BODY> 

<H1> 

Menu for 

<xsl :process select= * restaurant /name " / > 

</Hl> 

<HR/> 

<BR/> 

<TABLE> 

<xsl rprocess select= " restaurant /menu •/> 

</TABLE> 

</BODY> 

</HTML> 

< /xs 1 ; tenipla te> 

<xsl : template match="item''> 
<TR> 

<x3l : process -chi Idren / > 
</TR> 

</xsl : teinplate> 

<xsl: template ma tch= "description" > 
<TD> 

<X3 1 : process -chi Idren / > 
</TD> 

< /xsl : template> 

<xsl : template ma tch= "price* > 

<TD> 

<B> 

<xsl : process -chi Idren /> 

</B> 

</TD> 

</xsl: template> 

</X3l : stylesheet > 

Rgure 5: XSL Style Sheet 



<:HEAD> 
<TITLE> 

Toad Hall Cafe 

</TITLE> 

</HEAD> 

<BODY> 

<H1> 

Menu for Toad Hall Cafe 

</Hl> 

<HR/> 

<BR/> 

<TABLE> 

<TR> 

<TD> 

cheese fritters 

</TD> 

<TD> 

<B> 

4.99 

</B> 

</TD> 

</TR> 

<TR> 

<TD> 

shrimp platter 

</TD> 

<TD> 

<B> 

5.49 

</B> 

</TD> 

</TR> 

<TR> 

<TD> 

grilled duck breast 

</TD> 

<TD> 

<B> 

12.75 

</B> 

</TD> 

</TR> 

</TABLE> 

</BODy> 

</HTML> 

Figure 6: XML Transformed to HTML 
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